Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Attaching package: ‘Hmisc’
The following objects are masked from ‘package:dplyr’:
src, summarize
The following objects are masked from ‘package:base’:
format.pval, units
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use
Attaching package: ‘lubridate’
The following objects are masked from ‘package:igraph’:
%--%, union
The following objects are masked from ‘package:base’:
date, intersect, setdiff, union
The purpose of this notebook is to give data locations, data ingestion code, and code for rudimentary analysis and visualization of COVID-19 data provided by New York Times, [NYT1].
The following steps are taken:
Ingest data
Take COVID-19 data from The New York Times, based on reports from state and local health agencies, [NYT1].
Take USA counties records data (FIPS codes, geo-coordinates, populations), [WRI1].
Merge the data.
Make data summaries and related plots.
Make corresponding geo-plots.
Do “out of the box” time series forecast.
Analyze fluctuations around time series trends.
Note that other, older repositories with COVID-19 data exist, like, [JH1, VK1].
Remark: The time series section is done for illustration purposes only. The forecasts there should not be taken seriously.
From the help of tolower:
capwords <- function(s, strict = FALSE) {
cap <- function(s) paste(toupper(substring(s, 1, 1)),
{s <- substring(s, 2); if(strict) tolower(s) else s},
sep = "", collapse = " " )
sapply(strsplit(s, split = " "), cap, USE.NAMES = !is.null(names(s)))
}
if( !exists("dfNYDataStates") ) {
dfNYDataStates <- read.csv( "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv",
colClasses = c("character", "character", "character", "integer", "integer"),
stringsAsFactors = FALSE )
colnames(dfNYDataStates) <- capwords(colnames(dfNYDataStates))
}
head(dfNYDataStates)
dfNYDataStates$DateObject <- as.POSIXct(dfNYDataStates$Date)
summary(as.data.frame(unclass(dfNYDataStates), stringsAsFactors = TRUE))
Date State Fips Cases Deaths DateObject
2020-03-28: 55 Washington : 608 53 : 608 Min. : 1 Min. : 0 Min. :2020-01-21 00:00:00
2020-03-29: 55 Illinois : 605 17 : 605 1st Qu.: 16759 1st Qu.: 363 1st Qu.:2020-07-22 00:00:00
2020-03-30: 55 California : 604 06 : 604 Median : 111110 Median : 2079 Median :2020-12-10 00:00:00
2020-03-31: 55 Arizona : 603 04 : 603 Mean : 324348 Mean : 6183 Mean :2020-12-10 03:59:37
2020-04-01: 55 Massachusetts: 597 25 : 597 3rd Qu.: 410822 3rd Qu.: 7371 3rd Qu.:2021-05-01 00:00:00
2020-04-02: 55 Wisconsin : 593 55 : 593 Max. :4654248 Max. :68106 Max. :2021-09-19 00:00:00
(Other) :30814 (Other) :27534 (Other):27534
Summary by state:
by( data = as.data.frame(unclass(dfNYDataStates)), INDICES = dfNYDataStates$State, FUN = summary )
Alternative summary:
Hmisc::describe(dfNYDataStates)
if(!exists("dfNYDataCounties") ) {
dfNYDataCounties <- read.csv( "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv",
colClasses = c("character", "character", "character", "character", "integer", "integer"),
stringsAsFactors = FALSE )
colnames(dfNYDataCounties) <- capwords(colnames(dfNYDataCounties))
}
head(dfNYDataCounties)
dfNYDataCounties$DateObject <- as.POSIXct(dfNYDataCounties$Date)
summary(as.data.frame(unclass(dfNYDataCounties), stringsAsFactors = TRUE))
Date County State Fips Cases Deaths DateObject
2021-09-03: 3251 Washington: 16899 Texas : 133550 : 15797 Min. : 0 Min. : 0.0 Min. :2020-01-21 00:00:00
2021-04-05: 3250 Unknown : 14228 Georgia : 87129 53061 : 608 1st Qu.: 172 1st Qu.: 2.0 1st Qu.:2020-08-14 00:00:00
2021-08-03: 3250 Jefferson : 14155 Virginia: 71733 17031 : 605 Median : 978 Median : 19.0 Median :2020-12-26 00:00:00
2021-08-04: 3250 Franklin : 13557 Kentucky: 64252 06059 : 604 Mean : 5821 Mean : 113.5 Mean :2020-12-25 10:05:41
2021-08-10: 3250 Jackson : 12903 Missouri: 61988 04013 : 603 3rd Qu.: 3377 3rd Qu.: 65.0 3rd Qu.:2021-05-09 00:00:00
2021-08-20: 3250 Lincoln : 12865 Illinois: 55194 06037 : 603 Max. :1444836 Max. :34072.0 Max. :2021-09-19 00:00:00
(Other) :1715883 (Other) :1650777 (Other) :1261538 (Other):1716564 NA's :39197
if(!exists("dfUSACountyData")){
dfUSACountyData <- read.csv( "https://raw.githubusercontent.com/antononcube/SystemModeling/master/Data/dfUSACountyRecords.csv",
colClasses = c("character", "character", "character", "character", "integer", "numeric", "numeric"),
stringsAsFactors = FALSE )
}
head(dfUSACountyData)
summary(as.data.frame(unclass(dfUSACountyData), stringsAsFactors = TRUE))
Country State County FIPS Population Lat Lon
UnitedStates:3143 Texas : 254 WashingtonCounty: 30 01001 : 1 Min. : 89 Min. :19.60 Min. :-166.90
Georgia : 159 JeffersonCounty : 25 01003 : 1 1st Qu.: 10980 1st Qu.:34.70 1st Qu.: -98.23
Virginia: 134 FranklinCounty : 24 01005 : 1 Median : 25690 Median :38.37 Median : -90.40
Kentucky: 120 JacksonCounty : 23 01007 : 1 Mean : 102248 Mean :38.46 Mean : -92.28
Missouri: 115 LincolnCounty : 23 01009 : 1 3rd Qu.: 67507 3rd Qu.:41.81 3rd Qu.: -83.43
Kansas : 105 MadisonCounty : 19 01011 : 1 Max. :10170292 Max. :69.30 Max. : -67.63
(Other) :2256 (Other) :2999 (Other):3137
dsNYDataCountiesExtended <-
dfNYDataCounties %>%
dplyr::inner_join( dfUSACountyData %>% dplyr::select_at( .vars = c("FIPS", "Lat", "Lon", "Population") ), by = c( "Fips" = "FIPS" ) )
dsNYDataCountiesExtended
ParetoPlotForColumns( as.data.frame(lapply(dsNYDataCountiesExtended[,c("Cases", "Deaths")], as.numeric)), c("Cases", "Deaths"), scales = "free" )
Note that in the plots in this sub-section we filter out Hawaii and Alaska.
ggplot2::ggplot(dsNYDataCountiesExtended[ dsNYDataCountiesExtended$Lon > -130, c("Lat", "Lon", "Cases")]) +
ggplot2::geom_point( ggplot2::aes(x = Lon, y = Lat, fill = log10(Cases)), alpha = 0.01, size = 0.5, color = "blue" ) +
ggplot2::coord_quickmap()
The most recent versions of leaflet RStudio are having problems with the visualization below.
cf <- colorBin( palette = "Reds", domain = log10(dsNYDataCountiesExtended$Cases), bins = 10 )
m <-
leaflet( dsNYDataCountiesExtended[, c("Lat", "Lon", "Cases")] ) %>%
addTiles() %>%
addCircleMarkers( ~Lon, ~Lat, radius = ~ log10(Cases), fillColor = ~ cf(log10(Cases)), color = ~ cf(log10(Cases)), fillOpacity = 0.8, stroke = FALSE, popup = ~Cases )
m
dsNYDataCountiesExtended
An alternative of the geo-visualization is to use a heat-map plot.
Make a heat-map plot by sorting the rows of the cross-tabulation matrix (that correspond to states):
matSDC <- xtabs( Cases ~ State + Date, dfNYDataStates, sparse = TRUE)
d3heatmap::d3heatmap( log10(matSDC+1), cellnote = as.matrix(matSDC), scale = "none", dendrogram = "row", colors = "Blues") #, theme = "dark")
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette RdYlBu is 11
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette RdYlBu is 11
Returning the palette you asked for with that many colors
Cross-tabulate states with dates over deaths and plot:
matSDD <- xtabs( Deaths ~ State + Date, dfNYDataStates, sparse = TRUE)
d3heatmap::d3heatmap( log10(matSDD+1), cellnote = as.matrix(matSDD), scale = "none", dendrogram = "row", colors = "Blues") #, theme = "dark")
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette RdYlBu is 11
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette RdYlBu is 11
Returning the palette you asked for with that many colors
In this section we do simple “forecasting” (not a serious attempt).
Make time series data frame in long form:
dfQuery <-
dfNYDataStates %>%
dplyr::group_by( Date, DateObject ) %>%
dplyr::summarise_at( .vars = c("Cases", "Deaths"), .funs = sum )
dfQueryLongForm <- tidyr::pivot_longer( dfQuery, cols = c("Cases", "Deaths"), names_to = "Variable", values_to = "Value")
head(dfQueryLongForm)
Plot the time series:
ggplot(dfQueryLongForm) +
geom_line( aes( x = DateObject, y = log10(Value) ) ) +
facet_wrap( ~Variable, ncol = 1 )
Fit using ARIMA:
fit <- forecast::auto.arima( dfQuery$Cases )
fit
Series: dfQuery$Cases
ARIMA(2,2,2)
Coefficients:
ar1 ar2 ma1 ma2
0.8617 -0.5085 -1.6136 0.8781
s.e. 0.0430 0.0463 0.0205 0.0285
sigma^2 estimated as 657099273: log likelihood=-7011.12
AIC=14032.24 AICc=14032.34 BIC=14054.27
Plot “forecast”:
plot( forecast::forecast(fit, h = 20) )
grid(nx = NULL, ny = NULL, col = "lightgray", lty = "dotted")
Fit with ARIMA:
fit <- forecast::auto.arima( dfQuery$Deaths )
fit
Series: dfQuery$Deaths
ARIMA(3,2,2)
Coefficients:
ar1 ar2 ar3 ma1 ma2
1.0055 -0.6488 -0.1742 -1.3976 0.6687
s.e. 0.0534 0.0586 0.0513 0.0406 0.0329
sigma^2 estimated as 152385: log likelihood=-4474.78
AIC=8961.55 AICc=8961.69 BIC=8987.99
Plot “forecast”:
plot( forecast::forecast(fit, h = 20) )
grid(nx = NULL, ny = NULL, col = "lightgray", lty = "dotted")
We want to see does the time series data have fluctuations around its trends and estimate the distributions of those fluctuations. (Knowing those distributions some further studies can be done.)
This can be efficiently using the software monad QRMon, [AAp2, AA1]. Here we load the QRMon package:
#devtools::install_github(repo = "antononcube/QRMon-R")
library(QRMon)
Warning: replacing previous import ‘magrittr::set_names’ by ‘purrr::set_names’ when loading ‘QRMon’
Here we plot the consecutive differences of the cases and deaths:
dfQueryLongForm <-
dfQueryLongForm %>%
dplyr::group_by( Variable ) %>%
dplyr::arrange( DateObject ) %>%
dplyr::mutate( Difference = c(0, diff(Value) ) ) %>%
dplyr::ungroup()
ggplot(dfQueryLongForm) +
geom_line( aes( x = DateObject, y = Difference ) ) +
facet_wrap( ~Variable, ncol = 1, scales = "free_y" )
From the plots we see that time series are not monotonically increasing, and there are non-trivial fluctuations in the data.
Here we take interesting part of the cases data:
dfQueryLongForm2 <-
dfQueryLongForm %>%
dplyr::filter( difftime( DateObject, as.POSIXct("2020-05-01")) >= 0 ) %>%
dplyr::mutate( Regressor = as.numeric(DateObject, origin = "1900-01-01") )
Here we specify a QRMon workflow that rescales the data, fits a B-spline curve to get the trend, and finds the absolute and relative errors (residuals, fluctuations) around that trend:
qrObj <-
QRMonUnit(dfQueryLongForm2 %>% dplyr::filter( Variable == "Cases") %>% dplyr::select( Regressor, Value) ) %>%
QRMonRescale(regressorAxisQ = F, valueAxisQ = T) %>%
QRMonEchoDataSummary %>%
QRMonQuantileRegression( df = 16, probabilities = 0.5 )
$Dimensions
[1] 507 2
$Summary
Regressor Value
Min. :1.588e+09 Min. :0.0000
1st Qu.:1.599e+09 1st Qu.:0.1253
Median :1.610e+09 Median :0.5160
Mean :1.610e+09 Mean :0.4584
3rd Qu.:1.621e+09 3rd Qu.:0.7773
Max. :1.632e+09 Max. :1.0000
Here we plot the fit:
qrObj <- qrObj %>% QRMonPlot(datePlotQ = T)
Here we plot absolute errors:
qrObj <- qrObj %>% QRMonErrorsPlot(relativeErrorsQ = F, datePlotQ = T)
Here is the summary:
summary( (qrObj %>% QRMonErrors(relativeErrorsQ = F) %>% QRMonTakeValue)[[1]]$Error )
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.0037112 -0.0008552 0.0000000 0.0002571 0.0007610 0.0123816
Here we plot relative errors:
qrObj <- qrObj %>% QRMonErrorsPlot(relativeErrorsQ = T, datePlotQ = T)
Here is the summary:
summary( (qrObj %>% QRMonErrors(relativeErrorsQ = T) %>% QRMonTakeValue)[[1]]$Error )
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.711690 -0.002776 0.000000 -0.004366 0.003178 0.107087
[NYT1] The New York Times, Coronavirus (Covid-19) Data in the United States, (2020), GitHub.
[WRI1] Wolfram Research Inc., USA county records, (2020), System Modeling at GitHub.
[JH1] CSSE at Johns Hopkins University, COVID-19, (2020), GitHub.
[VK1] Vitaliy Kaurov, Resources For Novel Coronavirus COVID-19, (2020), community.wolfram.com.
[AA1] Anton Antonov, “A monad for Quantile Regression workflows”, (2018), at MathematicaForPrediction WordPress.
[AAp1] Anton Antonov, Heatmap plot Mathematica package, (2018), MathematicaForPrediciton at GitHub.
[AAp2] Anton Antonov, Monadic Quantile Regression Mathematica package, (2018), MathematicaForPrediciton at GitHub.